9 research outputs found

    Video Analysis for Understanding Human Actions and Interactions

    Get PDF
    Each time that we act, our actions are not just conditioned by the spatial information, e.g., objects, people, and the scene where we are involved. These actions are also conditioned temporally with the previous actions that we have done. Indeed, we live in an evolving and dynamic world. To understand what a person is doing, we reason jointly over spatial and temporal information. Intelligent systems that interact with people and perform useful tasks will also require this ability. In light of this need, video analysis has become, in recent years, an essential field in computer vision, providing to the community a wide range of tasks to solve. In this thesis, we make several contributions to the literature of video analysis, exploring different tasks that aim to understand human actions and interactions. We begin by considering the challenging problem of human action anticipation. In this task, we seek to predict a person's action as early as possible before it is completed. This task is critical for applications where machines have to react to human actions. We introduce a novel approach that forecasts the most plausible future human motion by hallucinating motion representations. Then, we address the challenging problem of temporal moment localization. It consists of finding the temporal localization of a natural-language query in a long untrimmed video. Although the queries could be anything that is happening within the video, the vast majority of them describe human actions. In contrast with the propose and rank approaches, where methods create or use predefined clips as candidates, we introduce a proposal-free approach that localizes the query by looking at the whole video at once. We also consider the temporal annotations' subjectivity and propose a soft-labelling using a categorical distribution centred on the annotated start and end. Equipped with a proposal-free architecture, we tackle the temporal moment localization introducing a spatial-temporal graph. We found that one of the limitations of the existing methods is the lack of spatial cues involved in the video and the query, i.e., objects and people. We create six semantically meaningful nodes. Three that are feed with visual features of people, objects, and activities, and the other three that capture the relationship at the language level of the "subject-object,'' "subject-verb," and "verb-object." We use a language-conditional message-passing algorithm to capture the relationship between nodes and create an improved representation of the activity. A temporal graph uses this new representation to determine the start and end of the query. Last, we study the problem of fine-grained opinion mining in video review using a multi-modal setting. There is increasing use of video as a source of information for guidance in the shopping process. People use video reviews as a guide to answering what, why, and where to buy something. We tackle this problem using the three different modalities inherently present in a video ---audio, frames, and transcripts--- to determine the most relevant aspect of the product under review and the sentiment polarity of the reviewer upon that aspect. We propose an early fusion mechanism of the three modalities. In this approach, we fuse the three different modalities at the sentence level. It is a general framework that does not lay in any strict constraints on the individual encodings of the audio, video frames and transcripts

    A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

    Get PDF
    Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202

    Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

    Full text link
    This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence. While previous works have tackled this task by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and {\em proposal-free approach} that relies on three key components: a dynamic filter to transfer language information to the visual domain, a new loss function to guide our model to attend the most relevant parts of the video, and soft labels to model annotation uncertainty. We evaluate our method on two benchmark datasets, Charades-STA and ActivityNet-Captions. Experimental results show that our approach outperforms state-of-the-art methods on both datasets.Comment: Winter Conference on Applications of Computer Vision 202

    The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose

    Full text link
    The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities, we introduce IKEA ASM---a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. Additionally, we benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks

    Divide and conquer: Efficient density-based tracking of 3D sensors in Manhattan worlds

    No full text
    3D depth sensors such as LIDARs and RGB-D cameras have become a popular choice for indoor localization and mapping. However, due to the lack of direct frame-to-frame correspondences, the tracking traditionally relies on the iterative closest point technique which does not scale well with the number of points. In this paper, we build on top of more recent and efficient density distribution alignment methods, and notably push the idea towards a highly efficient and reliable solution for full 6DoF motion estimation with only depth information. We propose a divide-and-conquer technique during which the estimation of the rotation and the three degrees of freedom of the translation are all decoupled from one another. The rotation is estimated absolutely and driftfree by exploiting the orthogonal structure in man-made environments. The underlying algorithm is an efficient extension of the mean-shift paradigm to manifold-constrained multiple-mode tracking. Dedicated projections subsequently enable the estimation of the translation through three simple 1D density alignment steps that can be executed in parallel. An extensive evaluation on both simulated and publicly available real datasets comparing several existing methods demonstrates outstanding performance at low computational costThe research leading to these results is supported by Australian Centre for Robotic Vision. The work is furthermore supported by ARC grants DE150101365. Yi Zhou acknowledges the financial support from the China Scholarship Council for his Ph.D. Scholarship No. 201406020098

    The Road to the WTO twelfth Ministerial Conference: A Latin American and Caribbean perspective

    No full text
    The context in which international food trade takes place has changed considerably since the last Ministerial Conference (MC11) in 2017. Significant progress has not been achieved in many import-ant issues that are still pending on the organization’s agenda. Moreover, geopolitical changes and the Covid-19 pandemic have drastically impacted the institutional priorities of countries and the WTO it-self. The global economy has substantially deteriorated over the past two years, with structural impacts in the areas of trade and food security, particularly for Latin America and the Caribbean (LAC). The multilateral trading system and its main organization, the WTO, have come under attack and are being discredited. The possibility of advancing towards coordinated solutions to major global issues through multilateral cooperation seems unlikely.Countries have adopted a wide range of strategic decisions to respond to the effects of this situation on international trade and agriculture. Many have revised their trade policies to adjust them to different scenarios with respect to food security and agricultural trade flows. The surge in commodity prices and a fear of food shortages have led some governments to apply restrictive measures that limit or tax agricultural exports. Other measures adopted include direct market interventions through public stock holdings, special safeguard mechanisms, and state trading enterprises. The adoption of these measures has triggered new debates on their effectiveness in reducing food insecurity and propelling the development of fair and transparent food markets.Regulations such as sustainability standards, access restrictions or domestic support measures must be transparent and aligned with WTO principles to avoid discretionary applications and discrimina-tory practices. Information transparency is key to access and develop new markets, especially under growing environmental scrutiny. Effective market access is crucial, not only for the development of agro-exporting countries (which prioritize this issue on their development agendas) but also for importing countries, as a means of guaranteeing food security and connecting main suppliers with buyers in regions facing food shortages.The WTO dispute settlement mechanism has become a strategic asset for developing countries, enabling them to continue expanding their agricultural exports and strengthening their position in the market. However, the current state of paralysis of the WTO Appellate Body has recently affected the institution’s effectiveness in regulating and arbitrating conflicts in the area of food trade relations. Most importantly, the growth strategy of Latin American countries depends on the WTO and the legal order that it enforces; therefore, actively contributing to its modernization and prioritizing its success as part of their trade and foreign policies is of crucial importance
    corecore